Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastic performance improvements for reads (#249) #342

Merged
merged 12 commits into from
Dec 11, 2024

Conversation

johannesloibl
Copy link
Contributor

@johannesloibl johannesloibl commented Dec 9, 2024

Possible solution for #249 .

See my comment here.
Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

I managed to greatly improve the performance by at least 12x (on 100k signals) to 25x (on 10k signals):

  • caching
  • skipping unneeded iterations over all data objects
  • skipping unneeded file seeks

What is still a problem: the reading time still scales O(n²) with the amount on channels and chunks, because the channel has to be found by iterating through the list of data objects AND chunks.
If i read only 10k signals, the improvement factor is currently ~25x, for 100k signals it's only ~12x.

You see most of the time is now spent in the _read_channel_data_chunk and _get_channel_number_values functions. Once the channel is found, the search is ended (before it was iterating till the end). This mean for many signals the time to find the desired signal increases.
We need to find a way to calculate where the channel position is in order to greatly speed this up,
but i don't know the internals with chunking not well enough to do this.

Before:
image

With my changes:
image

@johannesloibl johannesloibl marked this pull request as ready for review December 9, 2024 22:25
Copy link
Owner

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @johannesloibl, this looks great. I've left some review comments to be addressed

nptdms/tdms_segment.py Show resolved Hide resolved
nptdms/tdms_segment.py Show resolved Hide resolved
nptdms/tdms_segment.py Outdated Show resolved Hide resolved
@adamreeve
Copy link
Owner

adamreeve commented Dec 10, 2024

Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

Just to add some more context and make sure I'm understanding this right, can you provide an example of how you're reading a file? Are you reading a subset of the channels and for each channel, reading all data at once?

@johannesloibl
Copy link
Contributor Author

johannesloibl commented Dec 10, 2024

Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

Just to add some more context and make sure I'm understanding this right, can you provide an example of how you're reading a file? Are you reading a subset of the channels and for each channel, reading all data at once?

We are reading subsets of each channel. The data is lab measurement data.
For each of N (~1-20) different operating conditions, each TDMS channel is appended either a scalar measurement or an array. So the final channel length of an array measurement with 1000 pts and 100 iterations would be 100k. We have M (~1-100) channels.
We now want to read out the data for each operating condition separately, which means for each operating condition we read M channels partially. And there basically the amount of read operations explode.
I don't want to read all channels fully at once, because i want the reader to also work in memory restricted and scalable cloud environments.

If you open this file with the TDMS Excel plugin, the cover page has 50k rows :D
image

@johannesloibl
Copy link
Contributor Author

Do you see a way to skip iterating through the data_objects and replace it by calculating the desired file offset?
This would be another major speed improvement for partial reads.

nptdms/base_segment.py Outdated Show resolved Hide resolved
@adamreeve
Copy link
Owner

Do you see a way to skip iterating through the data_objects and replace it by calculating the desired file offset?
This would be another major speed improvement for partial reads.

It should be possible to compute them when reading the segment metadata, although that might add a bit more memory overhead, and this wouldn't be needed for the case where you read all data up front so we'd probably want to disable that behaviour in that case.

@johannesloibl
Copy link
Contributor Author

Ready to merge from my point of view ;) Thanks for the quick support!
Glad my runtime reduced from 1.5h to 4min :D

@adamreeve adamreeve merged commit 7d4e0b9 into adamreeve:master Dec 11, 2024
14 checks passed
@adamreeve
Copy link
Owner

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants